Skip to content

Conversation

@vermouth1992
Copy link
Collaborator

Reverts #3769

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts a previous fix for a serialization issue with the asynchronous reward function. While the original fix may have had drawbacks, this revert re-introduces a critical bug that can cause training to crash under certain configurations. Specifically, if asynchronous reward computation is enabled along with sandbox fusion, passing self.reward_fn to a remote Ray function will fail due to it containing a non-serializable multiprocessing.Semaphore.

future_reward = compute_reward_async.remote(
data=batch, config=self.config, tokenizer=self.tokenizer
)
future_reward = compute_reward_async.remote(data=batch, reward_fn=self.reward_fn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This revert re-introduces a critical serialization issue. The self.reward_fn object can be non-serializable by Ray, especially when sandbox_fusion is enabled, as it may contain a multiprocessing.Semaphore (see verl/trainer/ppo/reward.py:135). Passing a non-serializable object to a remote function will cause the training to crash with a SerializationError.

The previous implementation, although using a deprecated pattern, correctly avoided this issue by reconstructing the reward_fn inside the remote worker. Reverting this fix without addressing the underlying serializability of self.reward_fn is a regression.

A more robust solution would be to ensure reward_fn is always serializable, for example by using ray.util.concurrency.Semaphore instead of multiprocessing.Semaphore. However, in the absence of that fix, the previous code is safer.

Suggested change
future_reward = compute_reward_async.remote(data=batch, reward_fn=self.reward_fn)
future_reward = compute_reward_async.remote(
data=batch, config=self.config, tokenizer=self.tokenizer
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant